
News Headline Data
News Headline Collection
This is Headline Dataset collected over three years (Jan-2014 to Dec-2016) by our AcT crawler at Computer Science & Engineering Department, Indian Institute of Technology, Roorkee.

Dataset contains about 12.5 Million news headlines collected from different domain webpages (such as sports, Entertainment, Business etc.) of about 35 news sources.

The Dataset is in the .sql format. There are two tables: Megatable and source_info. Tables are described as follows.

Megatable (contains information about headline)

id: unique identifier for headline
newsheadline: News Headline text
start_time_stamp: Timestamp of first occurrence of news headline
end_time_stamp: Timestamp of last occurrence of news headline
URL: URL of news article for the headline
source_id: Foreign key in table Source_Info .  

The Megatable is divided year wise in three dumps. Following are download links for the dumps.

Year: 2014; Headline count: 3.7M Size: 830MB; link
Year: 2015; Headline count: 4.0M Size: 930MB; link 
Year: 2016; Headline count: 4.8M Size: 1.10GB; link 

Source_Info (contains information about domain web pages)

id: id of the  domain web page
URL: URL of the domain web page
Category: Category (sports, Entertainment, Business etc) of the domain web page
Source_Info: download link
Using these resources
These resources are subject to a CC-BY 2.5 IN license. 

Please cite following paper in any published works using these resources 
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